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Abstract 

The Trotter-Suzuki approximation leads to an efficient algorithm for solving the time- 
dependent Schrodinger equation. Using existing highly optimized CPU and GPU kernels, 
we developed a distributed version of the algorithm that runs efficiently on a cluster. Our 
implementation also improves single node performance, and is able to use multiple GPUs 
within a node. The scaling is close to linear using the CPU kernels, whereas the efficiency of 
GPU kernels improve with larger matrices. We also introduce a hybrid kernel that simulta- 
neously uses multicore CPUs and GPUs in a distributed system. This kernel is shown to be 
efficient when the matrix size would not fit in the GPU memory. Larger quantum systems 
scale especially well with a high number nodes. The code is available under an open source 
license. 

Keywords: GPU Computing, MPI, Hamiltonian, Quantum Evolution, Trotter-Suzuki 
Algorithm, Hybrid Kernel 



1. Introduction 

The evolution of a general quantum sys- 
tem is described by the time-dependent 
Schrodinger equation. The generic solution 
of this equation involves calculating a ma- 
trix exponential, which is formally simple. 
However, computer implementations must 
consider several factors to achieve high per- 
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formance and high accuracy - usually mak- 
ing a trade-off between these two indicators. 

There is a wide range of numerical ap- 
proaches to calculating a matrix exponen- 
tial. However, since they are approximate, 
not all may preserve some desired ana- 
lytical property of the original matrices. 
This is crucial for example with the quan- 
tum evolution operator - the exponential 
of the Hamiltonian matrix appearing in the 
Schrodinger equation, - which must be uni- 
tary in order to conserve the total probabil- 
ity. The reference method for calculating a 
matrix exponential is to diagonalize the ma- 
trix using an eigendecomposition, which is 
typically computationally intensive. While 
efficient eigendecomposition algorithms ex- 
ist that use multicore CPU and multiple 
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GPUs in a system |7j. distributed variants 
that use several computer nodes in a cluster 
are hard to parallelize with close to ideal ef- 
ficiency [10J. Traditional numerical integra- 
tors like the Runge-Kutta algorithm do not 
conserve unitarity, and unitary algorithms 
like the Crank- Nicholson scheme involve in- 
verting a large matrix. 

The Trotter- Suzuki algorithm approaches 
the problem through a slightly different an- 
gle. It decomposes the Hamiltonian as a 
sum of matrices that are easy to exponen- 
tiate |5J, which are then used to approxi- 
mate the exponential of the full Hamilto- 
nian. The end result is an algorithm that is 
easy to parallelize. For the case of a single 
particle in real space that we treat here, the 
algorithm discretizes the domain with a fi- 
nite mesh and calculates the pairwise evolu- 
tion between neighboring sites in the mesh. 
The Trotter-Suzuki algorithm has been suc- 
cessfully used 0, El E] . Efficient kernels for 
contemporary multicore CPUs and GPUs 
have already been developed p]. 

This paper introduces a distributed vari- 
ant of the Trotter-Suzuki algorithm using 
existing high-performance kernels. Our im- 
plementation improves the single-node effi- 
ciency of the CPU and GPU kernels, it is 
able to use multiple GPUs in a single sys- 
tem, and also introduces a hybrid kernel to 
deal with large quantum systems that do 
not fit the GPU memory. The implementa- 
tion also scales in a cluster, the CPU vari- 
ant shows an almost linear speedup with the 
number of nodes, whereas the GPU vari- 
ant is more efficient when the device has a 
higher load. 

The rest of this paper is organized as fol- 
lows. Section [2] gives a brief overview of 
the foundations of the Trotter-Suzuki algo- 
rithm. Section [3] provides the details of the 
distributed kernels, and Section [4] discusses 



the benchmark results on a large cluster. Fi- 
nally, Section [5] concludes the paper and of- 
fers an insight on our future work. 

2. Trotter-Suzuki Algorithm and Effi- 
cient Kernels 

The non-relativistic Schrodinger equation 
describing the evolution of a quantum sys- 
tem is 

where T-L(t) is the Hamiltonian of the sys- 
tem, and \ip(t)) the state of the system at 
time t. By choosing a basis, the Hamilto- 
nian can be written as a Hermitian matrix 
H . The formal solution to the Schrodinger 
equation for a time-independent Hamilto- 
nian [j is given by 



\m) = u(t)\m), (2) 



where \ip(0)) is the initial state of the sys- 
tem, with norm \(ip(0)\ip(0))\ 2 = 1, and 
U (t) is the quantum evolution operator as- 
sociated to Hamiltonian H. Since H is 
Hermitian, it is easy to see that the evo- 
lution operator is unitary, and that the 
norm of the state vector is constant over 
time, |^(t)|^))| 2 = |(^(0)|f/t^(0))| 2 = 
| (^(0)|^(0)) | 2 = 1 Thus, it is crucial that 
the numerical solution of the evolution oper- 
ator be unitary [5], or the norm of the wave 
function - which gives the total probability 
of finding the particle somewhere, and must 
equal one - would not be conserved. 

The Trotter-Suzuki algorithm decom- 
poses the Hamiltonian into small diagonal 



^^The general time-dependent form of the 
Trotter-Suzuki algorithm is a trivial extension of 
the one we present [IT] . 



or block diagonal matrices, where the expo- 
nential is easy to compute. The decomposi- 
tion is based on the Trotter formula 1151: 
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where A and B are M x M matrices, 
sufficiently large n, e x ( A + B )/ n 
The Trotter formula is readily generalized 
to the case of more than two contributions 
to H by writing H = X^=i H%- This allows 
for choosing a decomposition that can be 
exploited to construct efficient algorithms. 
The error of the Trotter formula scales as 
(x/n) 2 times the norm of the commutator 
between A and B. Higher order approx- 
imations were later developed by Suzuki 
[T51 [T3] . who obtained expressions that are 
unitary for all orders. 

To explain the algorithm, let us assume a 
one- dimensional Hamiltonian with the form 
H = — j§^;|^ + 'l /7 ( a; )> where m is the mass of 
the particle, and V(x) the potential energy. 
Discretizing the Schrodinger equation using 
a finite mesh of points spaced by a distance 
a, the Hamiltonian matrix H is like that 
of a tight-binding chain - a tridiagonal ma- 
trix. Such a matrix can be split as a sum of 
a diagonal matrix, and two block-diagonal 
commuting matrices made up of 2 x 2 ma- 
trices: 

H = H + Hx + H 2 , 

where the components are as follows. The 
diagonal matrix Hq is written as 



fe x 
e 2 

V 



\ 



e L J 



where q = V(na) + H 2 /ma 2 is the effective 
energy at site I, L is the number of sites. 



The block diagonal matrices Hi and H 2 are 
written as 
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where V = —h 2 /2ma 2 is the tunneling ele- 
ment between the sites. The exponential of 
a block matrix is itself a block matrix build 
from exponentials of 2 x 2 matrices. These 
plane rotation matrices can be written as 
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where At is the discrete time step. Using 
the above decomposition, the first-order ap- 
proximation of the unitary time step evolu- 
tion operator is given by 



Ui (At) 



Approximants correct up to second order 
are obtained by symmetrization [5l [12] : 



U 2 (At) = Uf (At) Ui (At) 



(3) 



where A T is the transpose of matrix A. Ex- 
tending the algorithm for more than one 
dimension is also straightforward, as we 
can simply perform a decomposition of the 



Hamiltonian into five parts: the diagonal 
energies, and two terms for each dimen- 
sion. Our implementation is based on the 
second order formulation of Equation [3j 
Higher order approximants are expressed as 
a sequence of applications of the first and 
second order operators. For example, the 
fourth order evolution operator is 

1/4 (At) = U 2 (pAt)U 2 (pAt) 
C/ 2 ((l-4p)At) 
U 2 (pAt) U 2 (pAt) , (4) 

where p = (4 - 4 1 / 3 )- 1 [13]. 

Because of its structure, the cost of ap- 
plying any order of the Trotter-Suzuki op- 
erator scales linearly in time and memory. 
A general external potential is straightfor- 
ward to implement and always adds a cost L 
in time. Therefore, we perform our bench- 
marks assuming a flat potential landscape, 
V(x) = constant, which leads to H induc- 
ing a global phase factor in the wave func- 
tion which we can ignore. 

Notice in the above discussion that at no 
point an assumption was made about the 
"importance" of a particular contribution to 
the Hamiltonian. This is the reason why the 
Trotter-Suzuki approach can be used where 
perturbation methods break down [5]. 

3. Distributing the Workload Across 
a Cluster 

We took the optimized CPU and GPU 
kernels of [1] as our starting point. These 
kernels use a double-buffered data ac- 
cess pattern: the result is not calcu- 
lated in place, but to a new buffer, and 
when a new iteration starts, the buffers 
are swapped. There are two CPU ker- 
nels: cache-optimized, and another cache- 
optimized that is further optimized to use 
the SSE instruction set of the CPU. 
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Figure 1 : A general overview of communication and 
computation pattern within one iteration in the dis- 
tributed implementation 



Cache optimization means that the in- 
put data is divided into blocks. A similar 
block division is present in the GPU ker- 
nel, where, instead of a hardware cache, the 
shared memory of the simultaneous multi- 
processors is used to fetch and explicitly 
cache data. This block division results in 
extra calculations: a halo for each block has 
to be computed to get the correct results for 
the internal cells. This extra work pays off 
by the benefit of cached access to the data. 

The unit of calculation of our distributed 
version is a process. Using a similar com- 
putational structure to the above, we refer 
to a block assigned to a process as a tile. 
The block division in a single node is sim- 
ple: the halo computed by a block is sim- 
ply discarded, and the next block reads its 
own halo from the main memory. In a dis- 
tributed version the halos between the tiles 
have to be sent across the processes. Using 
a two-dimensional grid of processes, a tile 
contains elements of halos belonging to a to- 
tal of eight other tiles: left, right, top, and 
bottom neighbours, and also the four diag- 
onal neighbours. To minimize the number 
of communication requests, a wave pattern 



is used in the communication: left and right 
neighbours receive the halo first. This halo 
has the height of the inner cells of the tile 
(see Figure [T|. Then the horizontal halos 
are sent to the top and bottom neighbours 
- the width is the full tile width. In this way 
the appropriate corner elements are propa- 
gated to the diagonal neighbours. 

Communication is performed asyn- 
chronously, but there is a communication 
barrier between the left-right and the 
top-bottom halo exchanges due to data 
dependency. The generic approach to 
computation and calculations attempts to 
overlap the two by as much as possible, 
calculating the halo first and starting 
the communication simultaneously to the 
calculation of the rest of the evolution step 
(Figure [T]). 

The step of starting the halo exchange 
initiates the asynchronous left-right halo ex- 
change. The last step, finishing the halo ex- 
change, has a communication barrier wait- 
ing for the first exchange to finish. Af- 
ter this, it initiates the asynchronous top- 
bottom halo exchange, and has a second 
barrier, waiting for this communication to 
finish. There are variations to this pattern, 
as detailed below. 

3.1. The CPU kernels 

The cache optimized kernels of [T] used 
OpenMP, a directive-driven parallelization 
of the code to use the power multicore archi- 
tecture [1] . We found that using our explicit 
parallelization, the overall performance was 
better by 30 % if we used the same number 
of process within a node as there are cores. 
Hence, even single node execution is accel- 
erated by our approach. This finding cor- 
responds well with other benchmarks that 
compare OpenMP parallelism with more ex- 
plicit forms of parallelism [8]. 



Finishing the halo exchange cannot be 
efficiently overlapped with communication. 
This means that once the computation of 
the iteration is completed, there is a com- 
munication overhead while the vertical ha- 
los are exchanged. 

3.2. The GPU kernel 

The GPU kernel had to be adjusted to 
work with communication. As it is cus- 
tomary in GPU programming, the device 
keeps the double-buffered matrices in its 
own memory, whereas MPI communication 
is performed from the host memory. This 
increases complexity as asynchronous mem- 
ory copies from the device and to the device 
have to be performed. 

To work with such transfer efficiently, we 
implemented the kernel with streams. A 
GPU stream is basically a queue of tasks 
for the GPU to perform: kernel execu- 
tion, memory copies from and to the device. 
Given two streams, memory copies in one 
stream can overlap with computation in the 
other stream. We use two streams, queue- 
ing the halo computation and the memory 
copies in stream one, and the computation 
of the rest of cells in stream two. Since a 
kernel launch with streams returns the con- 
trol to the calling process immediately, this 
also means that while stream two is execut- 
ing, every halo exchange can be completed 
before the iteration is finished. Hence the 
distributed GPU version has a much bet- 
ter communication efficiency than the CPU 
variant. 

The communication routine typically 
does an internal buffering of the data to be 
sent. This extra memory copy is unseen 
by the user. To avoid it, we use pinned 
memory, a specific way to allocate host 
memory for data that will be exchanged 



with the GPU. Pinned memory avoids in- 
ternal buffering when it comes to communi- 
cation, hence further increasing communi- 
cation performance. 

We use one process per GPU and single- 
node execution with multiple GPUs is pos- 
sible. 



3.3. The hybrid kernel 



Using streams means most cores of a mul- 
ticore CPU idle while waiting for the GPU 
to complete the calculations. GPUs also 
have less memory, limiting the size of the 
quantum system for which the evolution is 
computed. To address these two issues, we 
introduce a hybrid kernel. 

The algorithm first calculates the maxi- 
mum amount of the tile that can be com- 
puted on the GPU assigned to the process. 
Then, using only one stream, it launches 
the kernel for the corresponding elements 
on the internal area of the tile. After the 
asynchronous kernel launches, it proceeds 
to calculate the halo and start the halo ex- 
change. Once the halo exchange is initiated, 
the elements not in the halo and not on the 
GPU are computed by the CPU. Finishing 
the halo exchange includes an extra step, an 
internal halo exchange between the part of 
the tile associated with the GPU and the 
rest of the matrix. 

This kernel also uses one process per 
GPU. This means that in an eight-core sys- 
tem with two GPUs, six cores would be idle. 
To utilize the power of the extra cores, we 
use the same directive-driven parallelism as 
the original CPU kernels of [I], relying on 
OpenMP. Hence all cores and all GPUs con- 
tribute to the work. 
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Figure 2: Memory requirements of the Trotter- 
Suzuki algorithm on a single node 



4. Discussion 

4-1. Experimental configuration 

The implementation of the distributed al- 
gorithm used MPI for communication^} We 
used bullx MPI, which is compatible with 
the MPI 2.1 standard, and it is built around 
OpenMPI. The compiler was the Intel Com- 
piler Chain, with all optimization turned on 
and OpenMP enabled. While these are pro- 
prietary tools, the code can also be compiled 
with open source software, such as Open- 
MPI and GCC. The GPU code was im- 
plemented with CUDA and compiled with 
CUDA 4.0, running with the corresponding 
runtime. 

The benchmarks were performed on the 
Minotauro cluster at the Barcelona Su- 
percomputing Center. Every node has 
two Intel Xeon E5649 six-core processors 
with 12MB of cache memory, clocked at 
2.53GHz, running Linux operating system 
with 24 GByte of RAM memory. Every 
node is equipped with two NVIDIA M2090 



The source code is available at https:// 



github . com/peterwittek/trotter-suzuki-mpi 
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(a) Single precision (b) Double precision 

Figure 3: Execution time, linear system size: 24576 (single precision), 17472 (double precision) 




(a) Single precision 
Figure 4: Execution time, linear system size: 




(a) Single precision 



Figure 5: Execution time, linear system size: 

cards, each one with 512 CUDA cores and 6 
GByte of GDDR5 memory. The MPI com- 
munication across the nodes is through an 
Infiniband Network. 

Figure [2] illustrates the memory con- 
straints on this cluster with respect to the 
size of the quantum system. The bench- 




(b) Double precision 
49152 (single precision), 34944 (double precision) 




(b) Double precision 
98304 (single precision), 69888 (double precision) 

marks below were chosen so as to maximize 
the memory usage on different cluster sizes. 

4-2. Benchmark results 

The benchmarks ran ten iterations on in- 
creasing cluster sizes, using a synthetic state 
in a square domain of different lengths L, 




(b) CPU with SSE 



(c) GPU 




(d) Hybrid 

Figure 6: MPI traces on four nodes 
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Figure 7: Load balance on four nodes 



where L = 24576, 49152, and 98304 in sin- 
gle precision, and L = 17472, 34944, 69888 
in double precision. Note that since this is 
a two dimensional quantum system, the di- 
mension of the corresponding Hilbert space 
is L 2 , and the Hamiltonian matrices would 
have size L 2 x L 2 . The dimensions were 
chosen so as to fill the device memory on 
cluster sizes one, four, and sixteen nodes. 



GPUs have a better a performance when the 
load is higher, whereas CPUs are less sen- 
sitive to the load. Choosing the matrix di- 
mensions to fit the GPUs in certain config- 
urations shows how the overall GPU perfor- 
mance decays as the cluster size increases, 
and it is also fair with respect to the CPU 
kernels due to their insensitivity to the load. 
The timing results are plotted in Figures |3j 
[4j and [5j The results show only the time 
taken in the main loop of the evolution (as 
depicted in Figure [I]), as the initalization 
takes considerably different amounts of time 
with different types of kernels. 

The CPU kernels show an almost linear 
scaling: the execution time is divided by ap- 
proximately two as the cluster size is dou- 
bled. Communication overhead increases 
with the cluster size, so eventually the ad- 
vantage of SSE optimization vanishes with 
large clusters (see also the next section on 



parallel efficiency). 

The GPU kernel has a more interesting 
scaling pattern. When the device memory 
is loaded to at least 50 %, the scaling is close 
to linear, just as in the case of CPU kernels. 
Then the execution time of individual GPUs 
remains almost constant, the curve flattens 
out, and there is little benefit to gain by this 
kernel in large clusters. 

The hybrid kernel trails the GPU kernel 
in cases where the problem would fit the 
GPU memory, with the execution time be- 
ing marginally longer. The real advantage 
is in cases where the device memory is in- 
sufficient. In such cases, the speedup can be 
close to 2x in double precision compared to 
the CPU kernels. 

4-3. Parallel efficiency 

Analysing the MPI traces reveals impor- 
tant information about the communication 
patterns (Figure [6]). We restrict our atten- 
tion to four nodes alone, other cluster sizes 
show similar patterns. The CPU kernels 
communicate in an almost identical pattern 
irrespective of the SSE optimization. Both 
of them are quite close to being optimal. 
Without SSE optimization, the average load 
is 94 %, so on average, the communica- 
tion overhead is approximately 6 % (Fig- 
ure [7]). The variation of load is small, the 
average /maximum ratio is 0.97. The SSE- 
optimized kernel has slightly worse indica- 
tors, with an average load of 90 %. This in- 
dicates that as the computation gets more 
efficient, the communication becomes a bot- 
tleneck. 

The GPU kernel apparently has a very 
different pattern. Only a fraction of the 
processes do any kind of work, the ones 
that are associated with a GPU. The plot 
of the MPI trace does not show the time 
spent in the CUDA kernel, since the launch 



is asynchronous with streams. Having no 
computational load, the CPU spends most 
of its time communicating, resulting in an 
average load of barely 71 %, and an aver- 
age/maximum ratio of 0.72, meaning that 
there is little variation across the processes. 

The trace of the hybrid kernel is surpris- 
ing, as it looks similar to the GPU trace. 
While there are considerably more threads, 
the ones that are associated with GPU op- 
erations follow the same pattern as above, 
resulting in long waits. The rest of the 
threads, however, overlap communication 
extremely efficiently. The overall load bal- 
ance and parallel efficiency are very similar 
to the GPU case. 

5. Conclusions and Future Work 

In this paper we have shown a dis- 
tributed variant of the Trotter-Suzuki al- 
gorithm based on efficient kernel imple- 
mentations. We have improved the single- 
node efficiency of CPU kernels by replac- 
ing OpenMP directives by explicit MPI par- 
allelization, and the GPU kernel by using 
streams. We have shown that our algorithm 
scales almost ideally, and that our hybrid 
kernel is efficient for calculating the evolu- 
tion of large systems in smaller clusters. 

The implementation can be improved in 
different ways. The current breakdown of 
tasks is entirely manual and hard-coded: 
halo calculation, halo communication, and 
calculation of internal cells, the latter which 
might be split between CPU and GPU re- 
sources. When we regard the computations 
alone, there is a well-defined task, the cal- 
culation of a block. The block size is dif- 
ferent on the CPU and the GPU. The for- 
mer is larger, but it is an integer multiple of 
the GPU block size, so we can use the CPU 
block size as the unit of calculation. Our im- 
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plementation is a typical thread-parallel ap- 
proach, where heavy processes perform the 
work. Task-based parallelism is another ap- 
proach, and the task in our case would be 
the calculation of a block. OmpSs is a vari- 
ant of OpenMP which takes this approach 
to parallelism [2|. It handles heterogeneous 
hardware that includes GPUs and CPUs, 
and takes care of the memory copies based 
on explicitly expressed data dependencies. 
Asynchronous communication can also be 
defined as a task. Hence theoretically it is 
possible to have a hybrid approach that is 
not hard-coded, but the distribution of halo 
calculation and internal cell calculation is 
decided by the OmpSs runtime. This also 
means that part of the halo might be cal- 
culated by the GPU, and it should also be 
easier to work on a cluster where some nodes 
have GPUs and others do not. To achieve 
this flexibility, we are working on an OmpSs 
version of our implementation. 

Another clear direction is to extend our 
implementation to three dimensional sys- 
tems. The variety of possible decomposi- 
tion strategies in this case is large, and a 
flexible implementation would be very use- 
ful to test out the performance of the differ- 
ent choices. The extension to three dimen- 
sions is also motivated by recent theoretical 
and experimental developments in ultracold 
atomic gases [9], which could not be simu- 
lated in a single computer with enough pre- 
cision. 
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